Skip to content

Fix ECDH key exchange race condition in distributed deployments#111

Merged
anhthii merged 2 commits into
masterfrom
fix-ecdh-key-exchange
Nov 1, 2025
Merged

Fix ECDH key exchange race condition in distributed deployments#111
anhthii merged 2 commits into
masterfrom
fix-ecdh-key-exchange

Conversation

@anhthii
Copy link
Copy Markdown
Contributor

@anhthii anhthii commented Nov 1, 2025

Fix ECDH Key Exchange Race Condition in Distributed Cluster

Problem Summary

The MPC cluster fails to start properly when nodes are deployed in a distributed environment (production). Nodes get stuck in an infinite ECDH key exchange retry loop, preventing the cluster from becoming ready and accepting requests.

Observed Symptoms

Node0 (successful):

23:12:41 - Register peerID=a8d392dd-9913-4ebe-8c52-4848cc84b962
23:12:41 - All peers are ready including ECDH exchange completion
23:12:42 - SigningConsumer: Sufficient peers ready, starting message consumption
23:12:44 - KeygenConsumer: All peers are ready, proceeding to consume messages

Node1 & Node2 (stuck):

23:13:38 - Starting to broadcast DH key nodeID=c4ee7396-9a72-4735-ab2e-a785b68d2bc0
23:13:43 - [ECDH exchange retriggerd] not all peers are ready
23:13:48 - [ECDH exchange retriggerd] not all peers are ready
23:13:53 - [ECDH exchange retriggerd] not all peers are ready
(repeats indefinitely...)

Impact:

  • Cluster never becomes fully operational
  • Only Node0 accepts requests, but cannot perform MPC operations (requires threshold)
  • Node1 and Node2 continuously retry ECDH exchange, consuming CPU and network resources
  • Production deployments fail to start

Root Cause Analysis

The Race Condition

There was a timing-dependent race condition in pkg/mpc/registry.go at the registerReadyPairs method (lines 129-134):

if len(peerIDs) == len(r.peerNodeIDs) && !r.ready {
    r.mu.Lock()
    r.ready = true
    r.mu.Unlock()
    logger.Info("All peers are ready including ECDH exchange completion")
}

The Bug: This code sets r.ready = true when all peers are registered in Consul, but does not verify that ECDH key exchange is actually complete. The log message claims "including ECDH exchange completion" but the code never calls r.isECDHReady() to verify this.

Why It Manifests in Production But Not Locally

Local Environment (Masked Bug)

When all nodes run on the same machine:

  • Near-zero network latency (~0.1ms)
  • NATS messages arrive almost instantly
  • Consul updates propagate immediately
  • By the time a node sees "all peers in Consul", ECDH keys have already been exchanged
  • The race condition exists but the window is so small it rarely triggers

Local Timeline:

Time 0ms:   Node0 starts → registers → broadcasts ECDH
Time 5ms:   Node1 starts → registers → broadcasts ECDH → receives Node0's key
Time 10ms:  Node2 starts → registers → broadcasts ECDH → receives all keys
Time 15ms:  All nodes complete ECDH (before Consul check happens)

Production Environment (Exposed Bug)

When nodes are on different servers:

  • Significant network latency (50-200ms+)
  • NATS message propagation takes time
  • Consul registration completes BEFORE ECDH keys are exchanged
  • The gap between Consul registration and ECDH completion is large enough to trigger the bug

Production Timeline:

Time 0ms:    Node0 starts → registers in Consul
Time 100ms:  Node1 & Node2 register in Consul
Time 150ms:  Node0 sees all 3 nodes in Consul ✓
             Node0 sets ready=true ❌ (BUG: ECDH not checked!)
             Node0 broadcasts ECDH key
Time 250ms:  Node1/Node2 receive Node0's ECDH key (network delay)
Time 300ms:  Node1/Node2 try to complete ECDH but are now stuck in retry loop

Why The Bug Causes an Infinite Loop

  1. Node0 incorrectly marks itself ready before ECDH completes
  2. Node0 sends health checks to peers with isEcdhReady=true
  3. Node1/Node2 haven't received Node0's ECDH key yet, their isEcdhReady=false
  4. When Node1/Node2 receive health checks from Node0, they see it's "ready" but they aren't
  5. Node1/Node2 retrigger ECDH exchange every 5 seconds
  6. However, they're already subscribed and broadcasting - the retrigger doesn't help
  7. The cluster is stuck: Node0 thinks it's ready, Node1/Node2 never complete

Solution

Changes Made

1. File: pkg/mpc/registry.go

Change 1.1: Refactored ready state check

Replaced inline ready check with dedicated method:

// OLD (buggy)
if len(peerIDs) == len(r.peerNodeIDs) && !r.ready {
    r.mu.Lock()
    r.ready = true
    r.mu.Unlock()
    logger.Info("All peers are ready including ECDH exchange completion")
}

// NEW (fixed)
r.checkAndUpdateReadyState()
Change 1.2: Added checkAndUpdateReadyState() method

New method that properly validates BOTH conditions:

func (r *registry) checkAndUpdateReadyState() {
    // Count ready peers in readyMap
    readyPeersCount := 0
    for _, isReady := range r.readyMap {
        if isReady {
            readyPeersCount++
        }
    }

    // Only mark as ready when BOTH conditions are met:
    // 1. All peers are registered in Consul
    // 2. ECDH key exchange is complete
    if readyPeersCount == len(r.peerNodeIDs) && r.isECDHReady() && !r.ready {
        r.mu.Lock()
        r.ready = true
        r.mu.Unlock()
        logger.Info("All peers are ready including ECDH exchange completion")
    }
}

Key improvement: Now explicitly checks r.isECDHReady() which verifies all symmetric keys are established.

Change 1.3: Register ECDH completion callback

In NewRegistry(), set up callback to re-check ready state when ECDH completes:

// Set up callback to check ready state when ECDH completes
ecdhSession.OnKeyExchangeComplete(func() {
    reg.checkAndUpdateReadyState()
})

Why this is critical: Handles the case where all peers are in Consul but ECDH completes later. When the last ECDH key is received, we immediately check if the cluster is now fully ready.

2. File: pkg/mpc/key_exchange_session.go

Change 2.1: Added callback mechanism to interface
type ECDHSession interface {
    // ... existing methods ...
    OnKeyExchangeComplete(callback func())  // NEW
}
Change 2.2: Added callback field to struct
type ecdhSession struct {
    // ... existing fields ...
    onKeyExchangeComplete    func()  // NEW
}
Change 2.3: Implemented callback setter
func (e *ecdhSession) OnKeyExchangeComplete(callback func()) {
    e.onKeyExchangeComplete = callback
}
Change 2.4: Trigger callback when ECDH completes

In the ECDH message handler, detect when all keys are received:

// Derive symmetric key using HKDF
symmetricKey := e.deriveSymmetricKey(sharedSecret, ecdhMsg.From)
e.identityStore.SetSymmetricKey(ecdhMsg.From, symmetricKey)

currentKeyCount := e.identityStore.GetSymetricKeyCount()
logger.Debug("ECDH progress", "peer", ecdhMsg.From, "current", currentKeyCount, "expected", len(e.peerIDs))

// Check if ECDH exchange is complete and notify callback
if currentKeyCount == len(e.peerIDs) && e.onKeyExchangeComplete != nil {
    logger.Info("ECDH key exchange completed successfully", "totalKeys", currentKeyCount)
    e.onKeyExchangeComplete()  // Notify registry
}

Key improvement: Immediately notify when ECDH is complete, allowing registry to update ready state.


How The Fix Works

Event-Driven Synchronization

The fix changes from timing assumptions to event-driven synchronization:

Before (timing-dependent):

  • Assumes: "If Consul shows N nodes, ECDH must be done"
  • Fails when network delays cause ECDH to lag behind Consul registration

After (event-driven):

  • Waits for BOTH actual events to confirm completion:
    1. Consul registration complete (all peers visible)
    2. ECDH exchange complete (callback triggered)
  • Works regardless of network timing or event ordering

Flow Comparison

Before (Broken):

1. Node0 registers in Consul
2. Node1, Node2 register in Consul
3. Node0 sees 3 nodes → sets ready=true ❌ (ECDH not verified!)
4. Node0 broadcasts ECDH key
5. Node1/Node2 haven't completed ECDH yet
6. Node1/Node2 stuck in retry loop

After (Fixed):

1. Node0 registers in Consul
2. Node1, Node2 register in Consul
3. Node0 checks: "All peers in Consul? YES. ECDH complete? NO" → ready stays false
4. Node0 broadcasts ECDH key
5. Node1, Node2 broadcast ECDH keys
6. Node0 receives last ECDH key → callback fires
7. Callback runs checkAndUpdateReadyState()
8. Now checks: "All peers in Consul? YES. ECDH complete? YES" → ready=true ✓
9. Similarly for Node1 and Node2
10. All nodes become ready, cluster operational

Testing

Verification Steps

  1. Build the fixed code:

    make build
  2. Test distributed deployment:
    Deploy to 3 separate servers and verify all nodes become ready:

    # Check logs on each node
    tail -f /home/ubuntu/.pm2/logs/mpcium0-dev-error.log
    tail -f /home/void/.pm2/logs/mpcium1-dev-error.log
    tail -f /home/ubuntu/.pm2/logs/mpcium2-dev-error.log

    Expected logs on all nodes:

    [timestamp] INF All peers are ready including ECDH exchange completion
    [timestamp] INF SigningConsumer: Sufficient peers ready, starting message consumption
    [timestamp] INF KeygenConsumer: All peers are ready, proceeding to consume messages
    
  3. Test with artificial latency (optional):
    Simulate production network conditions locally:

    # Add 50ms delay to localhost
    sudo tc qdisc add dev lo root netem delay 50ms
    
    # Run E2E tests
    make e2e-test
    
    # Remove delay
    sudo tc qdisc del dev lo root

Test Scenarios Covered

  • ✅ Local deployment (all nodes on same machine)
  • ✅ Distributed deployment (nodes on different servers)
  • ✅ High network latency environments
  • ✅ Nodes starting at different times
  • ✅ Nodes restarting/reconnecting

Impact

Before

  • ❌ Distributed deployments fail to start
  • ❌ Nodes stuck in infinite ECDH retry loop
  • ❌ Cluster never becomes operational
  • ❌ Production environments unusable

After

  • ✅ Distributed deployments work reliably
  • ✅ All nodes properly complete ECDH and become ready
  • ✅ Cluster becomes operational regardless of network conditions
  • ✅ Production environments stable

Additional Notes

Why This Bug Was Hard to Detect

  1. Works in local development: Most testing is done locally where latency is negligible
  2. Misleading log message: The log claimed "ECDH exchange completion" but never verified it
  3. Race condition: Only manifests under specific timing conditions (distributed deployment)
  4. Silent failure mode: No errors logged, just appears as "waiting forever"

Network Factors That Affect Production

  • Latency: Inter-server communication (50-200ms) vs localhost (<1ms)
  • Packet loss: Production networks have 0.1-1% loss, localhost has none
  • Load variation: Different servers may be under different loads
  • Firewalls/routing: Production has VPCs, security groups that add latency
  • NATS topology: Production may use clustered NATS, adding complexity

Future Improvements

Consider adding:

  • Timeout mechanism if ECDH doesn't complete within X seconds
  • Metrics/monitoring for ECDH completion time
  • Health check endpoint showing ECDH status
  • Alerting if nodes are stuck in retry loop

Files Changed

  • pkg/mpc/registry.go - Fixed ready state logic and added callback handling
  • pkg/mpc/key_exchange_session.go - Added callback mechanism for ECDH completion

Breaking Changes

None. This is a bug fix that maintains backward compatibility.

Migration Guide

No migration needed. Simply deploy the updated binaries.

@anhthii anhthii force-pushed the fix-ecdh-key-exchange branch from 4f9aa32 to 65cf6a6 Compare November 1, 2025 02:49
@anhthii anhthii force-pushed the fix-ecdh-key-exchange branch from c70291a to cea0f1d Compare November 1, 2025 05:30
@anhthii anhthii merged commit 4c8529f into master Nov 1, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant